Can Clustered File Systems Support Data Intensive Applications?

نویسندگان

  • Rajagopal Ananthanarayanan
  • Karan Gupta
  • Prashant Pandey
  • Himabindu Pucha
  • Prasenjit Sarkar
  • Mansi Shah
  • Renu Tewari
چکیده

This WIP attempts to address the question—Can cluster file systems match specialized file systems such as Google's GFS for data-intensive applications? With the explosive growth of information and applications exploiting that information, large-scale data processing has emerged as an important challenge. Example applications include web search, indexing and mining, discovering biological functions from genomic sequences, astronomical phenomena from telescope imagery, and brain-scale networks for cog-nitive systems. These data-intensive applications demand a scalable, yet cost-effective storage layer. The storage layer for these applications needs to scale to thousands of nodes so that highly parallel applications process petabytes of data in hours rather than days. At the same time, the infrastructure needs to be built on commodity components to minimize cost while tolerating the failures that are typical for these components. Given the large volumes of data being processed, another key requirement for this storage layer is to enable shipping compute to data rather than the other way around. Recently, enterprises faced with these critical needs of data-intensive applications have proposed specialized file systems, built with the unique requirements of the layer in mind. For example , Google developed the GFS file system that is optimized for large sequential and small random reads on a small number of large files residing on a commodity cluster. Companies such as Yahoo and Kosmix followed this trend by emulating the GFS architecture in Hadoop DFS and KFS respectively. For the scope of this work, we choose the open source Hadoop DFS (HDFS) as a representative specialized file system. This work argues that cluster file systems can also rise to the challenges posed by these data-intensive applications. Moreover, there are inherent advantages to using cluster file systems in this paradigm: (1) These file systems can provide well-known traditional file APIs to these new class of applications. (2) Given that these file systems have been around for a while, they are enabled with a rich set of management tools such as automated backup and disaster recovery etc. (3) These file systems can simultaneously support legacy applications that rely on traditional file APIs obviating the need to maintain different storage layers for different applications. (4) Finally, an interesting trend further motivates this study: enterprises are increasingly incorporating data analytics in their workflows resulting in a mix of legacy and the new class of data-intensive applications accessing a common storage layer. There is ample evidence that existing cluster file …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Checkpointing Orchestration for Performance Improvement

Checkpointing is a mostly used mechanism for supporting fault tolerance of high performance computing (HPC), but notorious in its expensive disk access. Parallel file systems such as Lustre, GPFS, PVFS are widely deployed on super computers to provide fast I/O bandwidth for general data-intensive applications. However, the unique feature of checkpointing makes it impossible to benefit from the ...

متن کامل

High-Performance Storage Support for Scientific Big Data Applications on the Cloud

This work studies the storage subsystem for scientific big data applications to be running on the cloud. Although cloud computing has become one of the most popular paradigms for executing data-intensive applications, the storage subsystem has not been optimized for scientific applications. In particular, many scientific applications were originally developed assuming a tightly-coupled cluster ...

متن کامل

Towards a Next Generation Distributed Middleware System for Many-Task Computing

Distributed computing systems have evolved over decades to support various types of scientific applications and overall computing paradigms have been categorized into HTC (High-Throughput Computing) to support bags of tasks which are usually long running, HPC (High-Performance Computing) for processing tightly-coupled communication-intensive tasks on top of dedicated clusters of workstations or...

متن کامل

Panache: A Parallel File System Cache for Global File Access

Cloud computing promises large-scale and seamless access to vast quantities of data across the globe. Applications will demand the reliability, consistency, and performance of a traditional cluster file system regardless of the physical distance between data centers. Panache is a scalable, high-performance, clustered file system cache for parallel data-intensive applications that require wide a...

متن کامل

Data - intensive file systems for Internet services : A rose by any other

Data-intensive distributed file systems are emerging as a key component of large scale Internet services and cloud computing platforms. They are designed from the ground up and are tuned for specific application workloads. Leading examples, such as the Google File System, Hadoop distributed file system (HDFS) and Amazon S3, are defining this new purpose-built paradigm. It is tempting to classif...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009